Goto

Collaborating Authors

 probability 0





A Proof of Lemma 1 (s,a) =p = s, A

Neural Information Processing Systems

Liu et al. [2018] first showed that stationary importance sampling methods can be viewed as Rao-Blackwellization of IS estimator, and claimed that the expectation of the likelihood-ratios conditioned on state and action is equal to the distribution ratio, as stated in Property 1. For completeness, we present a proof of Property 1. Recall that d This gives us the expression " This additional marginalization step over time allows us to consider time-independent distribution ratios. Then, using the law of total expectation, we can write the expectation of the second sum in (4) as: " Assumption 1. Plugging in the final expression from (5) back into (4) gives us " Note that in the infinite horizon setting where L!1and for finite n, (6) becomes " Similarly, by generalizing this pattern it can be observed that on unrolling n times, we will get, 1 " 0 X For all experiments, we utilize the domains and algorithm implementations from Caltech OPE Benchmarking Suite (COBS) library by Voloshin et al. [2019]. We include a brief description of each of these domains below, and a full description of each can be found in the work by Voloshin et al. [2019]. Graph Environment The Graph environment is a two-chain environment with 2L states and 2 actions.


7fd3b80fb1884e2927df46a7139bb8bf-Supplemental.pdf

Neural Information Processing Systems

The IDs of the 10 datasets used in this work, as well as the number of examples and features, are provided in Table 1 in the main manuscript. All of the datasets correspond to binary classification problems, with varying degrees of class imbalance. While the prediction is always performed in the logarithmic domain, when evaluating the models we transform both the labels and the model predictions back into their original domain. The loss function used for training and evaluation is the standard root mean-squared error (sklearn.metrics.mean_squared_error). We download the raw data programmatically using the Kaggle API, which produces the filetrain.tsv.




Fast Non-Episodic Finite-Horizon RL with K-Step Lookahead Thresholding

Xu, Jiamin, Gan, Kyra

arXiv.org Machine Learning

Online reinforcement learning in non-episodic, finite-horizon MDPs remains underexplored and is challenged by the need to estimate returns to a fixed terminal time. Existing infinite-horizon methods, which often rely on discounted contraction, do not naturally account for this fixed-horizon structure. We introduce a modified Q-function: rather than targeting the full-horizon, we learn a K-step lookahead Q-function that truncates planning to the next K steps. To further improve sample efficiency, we introduce a thresholding mechanism: actions are selected only when their estimated K-step lookahead value exceeds a time-varying threshold. We provide an efficient tabular learning algorithm for this novel objective, proving it achieves fast finite-sample convergence: it achieves minimax optimal constant regret for $K=1$ and $\mathcal{O}(\max((K-1),C_{K-1})\sqrt{SAT\log(T)})$ regret for any $K \geq 2$. We numerically evaluate the performance of our algorithm under the objective of maximizing reward. Our implementation adaptively increases K over time, balancing lookahead depth against estimation variance. Empirical results demonstrate superior cumulative rewards over state-of-the-art tabular RL methods across synthetic MDPs and RL environments: JumpRiverswim, FrozenLake and AnyTrading.


Cohort-Based Active Modality Acquisition

Rheude, Tillmann, Eils, Roland, Wild, Benjamin

arXiv.org Artificial Intelligence

Real-world machine learning applications often involve data from multiple modalities that must be integrated effectively to make robust predictions. However, in many practical settings, not all modalities are available for every sample, and acquiring additional modalities can be costly. This raises the question: which samples should be prioritized for additional modality acquisition when resources are limited? While prior work has explored individual-level acquisition strategies and training-time active learning paradigms, test-time and cohort-based acquisition remain underexplored. We introduce Cohort-based Active Modality Acquisition (CAMA), a novel test-time setting to formalize the challenge of selecting which samples should receive additional modalities. We derive acquisition strategies that leverage a combination of generative imputation and discriminative modeling to estimate the expected benefit of acquiring missing modalities based on common evaluation metrics. We also introduce upper-bound heuristics that provide performance ceilings to benchmark acquisition strategies. Experiments on multimodal datasets with up to 15 modalities demonstrate that our proposed imputation-based strategies can more effectively guide the acquisition of additional modalities for selected samples compared with methods relying solely on unimodal information, entropy-based guidance, or random selection. We showcase the real-world relevance and scalability of our method by demonstrating its ability to effectively guide the costly acquisition of proteomics data for disease prediction in a large prospective cohort, the UK Biobank (UKBB). Our work provides an effective approach for optimizing modality acquisition at the cohort level, enabling more effective use of resources in constrained settings.


Token-Level Marginalization for Multi-Label LLM Classifiers

Praharaj, Anjaneya, Kasundra, Jaykumar

arXiv.org Artificial Intelligence

This paper addresses the critical challenge of deriving interpretable confidence scores from generative language models (LLMs) when applied to multi-label content safety classification. While models like LLaMA Guard are effective for identifying unsafe content and its categories, their generative architecture inherently lacks direct class-level probabilities, which hinders model confidence assessment and performance interpretation. This limitation complicates the setting of dynamic thresholds for content moderation and impedes fine-grained error analysis. This research proposes and evaluates three novel token-level probability estimation approaches to bridge this gap. The aim is to enhance model interpretability and accuracy, and evaluate the generalizability of this framework across different instruction-tuned models. Through extensive experimentation on a synthetically generated, rigorously annotated dataset, it is demonstrated that leveraging token logits significantly improves the interpretability and reliability of generative classifiers, enabling more nuanced content safety moderation.